AITopics | mispronunciation detection

Collaborating Authors

mispronunciation detection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

Tu, Huu Tuong, Khanh, Ha Viet, Dat, Tran Tien, Huan, Vu, Van Luong, Thien, Cuong, Nguyen Tien, Trang, Nguyen Thi Thu

arXiv.org Artificial IntelligenceNov-26-2025

ABSTRACT Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pre-trained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training. Index T erms-- Mispronunciation detection and diagnosis, retrieval-based methods, training-free framework, automatic pronunciation assessment 1. INTRODUCTION Mispronunciation Detection and Diagnosis is a fundamental task in Computer-Assisted Pronunciation Training (CAPT).

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.20107

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > Vietnam > Hanoi > Hanoi (0.04)

Genre: Research Report (0.64)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition

Kucukmanisa, Ayhan, Gelmez, Derya, Calik, Sukru Selim, Kilimci, Zeynep Hilal

arXiv.org Artificial IntelligenceNov-24-2025

Recent advances in multimodal deep learning have greatly enhanced the capability of systems for speech analysis and pronunciation assessment. Accurate pronunciation detection remains a key challenge in Arabic, particularly in the context of Quranic recitation, where subtle phonetic differences can alter meaning. Addressing this challenge, the present study proposes a transformer-based multimodal framework for Arabic phoneme mispronunciation detection that combines acoustic and textual representations to achieve higher precision and robustness. The framework integrates UniSpeech-derived acoustic embeddings with BERT-based textual embeddings extracted from Whisper transcriptions, creating a unified representation that captures both phonetic detail and linguistic context. To determine the most effective integration strategy, early, intermediate, and late fusion methods were implemented and evaluated on two datasets containing 29 Arabic phonemes, including eight hafiz sounds, articulated by 11 native speakers. Additional speech samples collected from publicly available YouTube recordings were incorporated to enhance data diversity and generalization. Model performance was assessed using standard evaluation metrics: accuracy, precision, recall, and F1-score, allowing a detailed comparison of the fusion strategies. Experimental findings show that the UniSpeech-BERT multimodal configuration provides strong results and that fusion-based transformer architectures are effective for phoneme-level mispronunciation detection. The study contributes to the development of intelligent, speaker-independent, and multimodal Computer-Aided Language Learning (CALL) systems, offering a practical step toward technology-supported Quranic pronunciation training and broader speech-based educational applications.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2511.17477

Country:

Asia > Middle East > Republic of Türkiye (0.04)
Europe > Switzerland (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment

Wang, Ke, Wei, Wenning, Deng, Yan, He, Lei, Zhao, Sheng

arXiv.org Artificial IntelligenceSep-22-2025

Automatic Pronunciation Assessment (APA) is critical for Computer-Assisted Language Learning (CALL), requiring evaluation across multiple granularities and aspects. Large Multimodal Models (LMMs) present new opportunities for APA, but their effectiveness in fine-grained assessment remains uncertain. This work investigates fine-tuning LMMs for APA using the Speechocean762 dataset and a private corpus. Fine-tuning significantly outperforms zero-shot settings and achieves competitive results on single-granularity tasks compared to public and commercial systems. The model performs well at word and sentence levels, while phoneme-level assessment remains challenging. We also observe that the Pearson Correlation Coefficient (PCC) reaches 0.9, whereas Spearman's rank Correlation Coefficient (SCC) remains around 0.6, suggesting that SCC better reflects ordinal consistency. These findings highlight both the promise and limitations of LMMs for APA and point to future work on fine-grained modeling and rank-aware evaluation.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.15701

Country: Asia > China > Beijing > Beijing (0.40)

Genre: Research Report > New Finding (0.47)

Industry: Education > Curriculum > Subject-Specific Education (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

Add feedback

Evaluating Logit-Based GOP Scores for Mispronunciation Detection

Parikh, Aditya Kamlesh, Tejedor-Garcia, Cristian, Cucchiarini, Catia, Strik, Helmer

arXiv.org Artificial IntelligenceSep-1-2025

Pronunciation assessment relies on goodness of pronunciation (GOP) scores, traditionally derived from softmax-based posterior probabilities. However, posterior probabilities may suffer from overconfidence and poor phoneme separation, limiting their effectiveness. This study compares logit-based GOP scores with probability-based GOP scores for mispronunciation detection. We conducted our experiment on two L2 English speech datasets spoken by Dutch and Mandarin speakers, assessing classification performance and correlation with human ratings. Logit-based methods outperform probability-based GOP in classification, but their effectiveness depends on dataset characteristics. The maximum logit GOP shows the strongest alignment with human perception, while a combination of different GOP scores balances probability and logit features. The findings suggest that hybrid GOP methods incorporating uncertainty modeling and phoneme-specific weighting improve pronunciation assessment.

artificial intelligence, machine learning, speech recognition, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2025-1012

2506.12067

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Netherlands (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)

Add feedback

Mispronunciation Detection Without L2 Pronunciation Dataset in Low-Resource Setting: A Case Study in Finland Swedish

Phan, Nhan, Kuronen, Mikko, Kautonen, Maria, Ullakonoja, Riikka, von Zansen, Anna, Getman, Yaroslav, Voskoboinik, Ekaterina, Grósz, Tamás, Kurimo, Mikko

arXiv.org Artificial IntelligenceAug-21-2025

Mispronunciation detection (MD) models are the cornerstones of many language learning applications. Unfortunately, most systems are built for English and other major languages, while low-resourced language varieties, such as Finland Swedish (FS), lack such tools. In this paper, we introduce our MD model for FS, trained on 89 hours of first language (L1) speakers' spontaneous speech and tested on 33 minutes of L2 transcribed read-aloud speech. We trained a multilingual wav2vec 2.0 model with entropy regularization, followed by temperature scaling and top-k normalization after the inference to better adapt it for MD. The main novelty of our method lies in its simplicity, requiring minimal L2 data. The process is also language-independent, making it suitable for other low-resource languages. Our proposed algorithm allows us to balance Recall (43.2%) and Precision (29.8%), compared with the baseline model's Recall (77.5%) and Precision (17.6%).

artificial intelligence, machine learning, pronunciation, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2025-1375

2506.01156

Country:

Europe > Sweden (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)

Add feedback

Segmentation-free Goodness of Pronunciation

Cao, Xinwei, Fan, Zijian, Svendsen, Torbjørn, Salvi, Giampiero

arXiv.org Artificial IntelligenceJul-30-2025

Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer aided language learning (CALL) systems. Within MDD, phoneme-level pronunciation assessment is key to helping L2 learners improve their pronunciation. However, most systems are based on a form of goodness of pronunciation (GOP) which requires pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general alignment-free method that takes all possible alignments of the target phoneme into account (GOP-AF). We give a theoretical account of our definition of GOP-AF, an implementation that solves potential numerical issues as well as a proper normalization which makes the method applicable with acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and Speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-AF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the Speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.16838

Country: Europe > Norway (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
(3 more...)

Add feedback

JCAPT: A Joint Modeling Approach for CAPT

Yang, Tzu-Hsuan, He, Yue-Yang, Chen, Berlin

arXiv.org Artificial IntelligenceJul-28-2025

Effective pronunciation feedback is critical in second language (L2) learning, for which computer-assisted pronunciation training (CAPT) systems often encompass two key tasks: automatic pronunciation assessment (AP A) and mispronunciation detection and diagnosis (MDD). Recent work has shown that joint modeling of these two tasks can yield mutual benefits. Our unified framework leverages Mamba, a selective state space model (SSM), while integrating phonological features and think token strategies to jointly enhance interpretability and fine-grained temporal reasoning in AP A and MDD. To our knowledge, this is the first study to combine phonological attribution, SSM-based modeling, and prompting in CAPT. A series of experiments conducted on the speechocean762 benchmark demonstrate that our model consistently outperforms prior methods, particularly on the MDD task.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.19315

Country:

Asia > Taiwan > Taiwan Province > Taipei (0.40)
Europe > Greece (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.70)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.48)
(3 more...)

Add feedback

Towards Efficient and Multifaceted Computer-assisted Pronunciation Training Leveraging Hierarchical Selective State Space Model and Decoupled Cross-entropy Loss

Chao, Fu-An, Chen, Berlin

arXiv.org Artificial IntelligenceFeb-11-2025

Prior efforts in building computer-assisted pronunciation training (CAPT) systems often treat automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as separate fronts: the former aims to provide multiple pronunciation aspect scores across diverse linguistic levels, while the latter focuses instead on pinpointing the precise phonetic pronunciation errors made by non-native language learners. However, it is generally expected that a full-fledged CAPT system should perform both functionalities simultaneously and efficiently. In response to this surging demand, we in this work first propose HMamba, a novel CAPT approach that seamlessly integrates APA and MDD tasks in parallel. In addition, we introduce a novel loss function, decoupled cross-entropy loss (deXent), specifically tailored for MDD to facilitate better-supervised learning for detecting mispronounced phones, thereby enhancing overall performance. A comprehensive set of empirical results on the speechocean762 benchmark dataset demonstrates the effectiveness of our approach on APA. Notably, our proposed approach also yields a considerable improvement in MDD performance over a strong baseline, achieving an F1-score of 63.85%. Our codes are made available at https://github.com/Fuann/hmamba

assessment, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2502.07575

Country:

Asia > Taiwan (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Massachusetts (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

A Novel Speech Analysis and Correction Tool for Arabic-Speaking Children

Berriche, Lamia, Driss, Maha, Almuntashri, Areej Ahmed, Lghabi, Asma Mufreh, Almudhi, Heba Saleh, Almansour, Munerah Abdul-Aziz

arXiv.org Artificial IntelligenceNov-18-2024

This paper introduces a new application named ArPA for Arabic kids who have trouble with pronunciation. Our application comprises two key components: the diagnostic module and the therapeutic module. The diagnostic process involves capturing the child's speech signal, preprocessing, and analyzing it using different machine learning classifiers like K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Decision Trees as well as deep neural network classifiers like ResNet18. The therapeutic module offers eye-catching gamified interfaces in which each correctly spoken letter earns a higher avatar level, providing positive reinforcement for the child's pronunciation improvement. Two datasets were used for experimental evaluation: one from a childcare centre and the other including Arabic alphabet pronunciation recordings. Our work uses a novel technique for speech recognition using Melspectrogram and MFCC images. The results show that the ResNet18 classifier on speech-to-image converted data effectively identifies mispronunciations in Arabic speech with an accuracy of 99.015\% with Mel-Spectrogram images outperforming ResNet18 with MFCC images.

artificial intelligence, machine learning, pronunciation, (18 more...)

arXiv.org Artificial Intelligence

2411.13592

Country:

Asia > Middle East > Saudi Arabia > Riyadh Province > Riyadh (0.05)
Africa > Middle East > Tunisia > Manouba Governorate > Manouba (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)

Genre:

Research Report > Promising Solution (0.48)
Research Report > New Finding (0.48)

Industry:

Education (0.68)
Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis

Wang, Xintong, Shi, Mingqian, Wang, Ye

arXiv.org Artificial IntelligenceJun-6-2024

Subsequently, Zhang et al. [1] adopted Mispronunciation Detection and Diagnosis (MDD) systems, an autoregressive model, the Recurrent Neural Network Transducer leveraging Automatic Speech Recognition (ASR), face two (RNN-T) [9], for MDD. This approach aims to capture main challenges in Mandarin Chinese: 1) The two-stage models the temporal dependence of mispronunciation patterns, showing create an information gap between the phoneme or tone classification better performance than Connectionist Temporal Classification stage and the MDD stage.

diagnosis, pitch encoder, pitch fusion block, (11 more...)

arXiv.org Artificial Intelligence

2406.04595

Country: Asia > Singapore > Central Region > Singapore (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback